Loading Packages¶

In [ ]:
# Reading in, manipulations
import numpy as np
import pandas as pd

# Plotting
import plotly_express as px
import plotly.graph_objects as go
import plotly.io as pio
from plotly.subplots import make_subplots

# Clustering
from sklearn.preprocessing import PowerTransformer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

# Other
import warnings
warnings.filterwarnings(action = "ignore")

Setting my theme and color scheme

In [ ]:
pio.templates.default = "plotly_white"
pio.renderers.default = "notebook"
custom_color_scale = ['#0000FF', '#33CCFF', '#66FF99', '#D2D200', '#FFFF00']

Reading data

In [ ]:
df = pd.read_csv("data/cc_general.csv")

EDA¶

In [ ]:
df.head()
Out[ ]:
CUST_ID BALANCE BALANCE_FREQUENCY PURCHASES ONEOFF_PURCHASES INSTALLMENTS_PURCHASES CASH_ADVANCE PURCHASES_FREQUENCY ONEOFF_PURCHASES_FREQUENCY PURCHASES_INSTALLMENTS_FREQUENCY CASH_ADVANCE_FREQUENCY CASH_ADVANCE_TRX PURCHASES_TRX CREDIT_LIMIT PAYMENTS MINIMUM_PAYMENTS PRC_FULL_PAYMENT TENURE
0 C10001 40.900749 0.818182 95.40 0.00 95.4 0.000000 0.166667 0.000000 0.083333 0.000000 0 2 1000.0 201.802084 139.509787 0.000000 12
1 C10002 3202.467416 0.909091 0.00 0.00 0.0 6442.945483 0.000000 0.000000 0.000000 0.250000 4 0 7000.0 4103.032597 1072.340217 0.222222 12
2 C10003 2495.148862 1.000000 773.17 773.17 0.0 0.000000 1.000000 1.000000 0.000000 0.000000 0 12 7500.0 622.066742 627.284787 0.000000 12
3 C10004 1666.670542 0.636364 1499.00 1499.00 0.0 205.788017 0.083333 0.083333 0.000000 0.083333 1 1 7500.0 0.000000 NaN 0.000000 12
4 C10005 817.714335 1.000000 16.00 16.00 0.0 0.000000 0.083333 0.083333 0.000000 0.000000 0 1 1200.0 678.334763 244.791237 0.000000 12

Making columns lowercase

In [ ]:
df.columns = df.columns.str.lower()

Checking for NAs

In [ ]:
df.isnull().sum().sort_values(ascending = False)
Out[ ]:
minimum_payments                    313
credit_limit                          1
cust_id                               0
balance                               0
prc_full_payment                      0
payments                              0
purchases_trx                         0
cash_advance_trx                      0
cash_advance_frequency                0
purchases_installments_frequency      0
oneoff_purchases_frequency            0
purchases_frequency                   0
cash_advance                          0
installments_purchases                0
oneoff_purchases                      0
purchases                             0
balance_frequency                     0
tenure                                0
dtype: int64

We've got some missing values, will deal with them later. Let's check the distributions ...

In [ ]:
df.describe()
Out[ ]:
balance balance_frequency purchases oneoff_purchases installments_purchases cash_advance purchases_frequency oneoff_purchases_frequency purchases_installments_frequency cash_advance_frequency cash_advance_trx purchases_trx credit_limit payments minimum_payments prc_full_payment tenure
count 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8950.000000 8949.000000 8950.000000 8637.000000 8950.000000 8950.000000
mean 1564.474828 0.877271 1003.204834 592.437371 411.067645 978.871112 0.490351 0.202458 0.364437 0.135144 3.248827 14.709832 4494.449450 1733.143852 864.206542 0.153715 11.517318
std 2081.531879 0.236904 2136.634782 1659.887917 904.338115 2097.163877 0.401371 0.298336 0.397448 0.200121 6.824647 24.857649 3638.815725 2895.063757 2372.446607 0.292499 1.338331
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 50.000000 0.000000 0.019163 0.000000 6.000000
25% 128.281915 0.888889 39.635000 0.000000 0.000000 0.000000 0.083333 0.000000 0.000000 0.000000 0.000000 1.000000 1600.000000 383.276166 169.123707 0.000000 12.000000
50% 873.385231 1.000000 361.280000 38.000000 89.000000 0.000000 0.500000 0.083333 0.166667 0.000000 0.000000 7.000000 3000.000000 856.901546 312.343947 0.000000 12.000000
75% 2054.140036 1.000000 1110.130000 577.405000 468.637500 1113.821139 0.916667 0.300000 0.750000 0.222222 4.000000 17.000000 6500.000000 1901.134317 825.485459 0.142857 12.000000
max 19043.138560 1.000000 49039.570000 40761.250000 22500.000000 47137.211760 1.000000 1.000000 1.000000 1.500000 123.000000 358.000000 30000.000000 50721.483360 76406.207520 1.000000 12.000000

There are quite some outliers!

In [ ]:
df.drop(['cust_id'], axis = 1, inplace = True)

Removing the customer Id as we will not need it.

In [ ]:
fig = px.box(df)
fig.update_layout(
    title_text = 'Not only quite some outliers but the range across variables is different.'
)
fig.show()

We will need to standardize the data later for clustering.

In [ ]:
nr_columns = len(df.columns)
nr_rows = (nr_columns + 1) // 2
nr_cols = 2

fig = make_subplots(rows = nr_rows, cols = nr_cols, subplot_titles = df.columns)

for i, col in enumerate(df.columns):
    row_num = (i // nr_cols) + 1 
    col_num = (i % nr_cols) + 1   
    fig.add_trace(go.Histogram(x = df[col], nbinsx = 30, name = str(col)), row = row_num, col = col_num)

fig.update_layout(height = 900, width = 600, showlegend = False)
fig.update_xaxes(tickfont = dict(size = 8))
fig.update_yaxes(tickfont = dict(size = 8))

fig.show()
In [ ]:
df[df['purchases'] == 0] 
Out[ ]:
balance balance_frequency purchases oneoff_purchases installments_purchases cash_advance purchases_frequency oneoff_purchases_frequency purchases_installments_frequency cash_advance_frequency cash_advance_trx purchases_trx credit_limit payments minimum_payments prc_full_payment tenure
1 3202.467416 0.909091 0.0 0.0 0.0 6442.945483 0.0 0.0 0.0 0.250000 4 0 7000.0 4103.032597 1072.340217 0.222222 12
14 2772.772734 1.000000 0.0 0.0 0.0 346.811390 0.0 0.0 0.0 0.083333 1 0 3000.0 805.647974 989.962866 0.000000 12
16 2072.074354 0.875000 0.0 0.0 0.0 2784.274703 0.0 0.0 0.0 0.250000 3 0 3000.0 391.974562 376.579631 0.000000 8
24 5368.571219 1.000000 0.0 0.0 0.0 798.949863 0.0 0.0 0.0 0.363636 4 0 6000.0 1422.726707 1657.002877 0.000000 11
35 1656.350781 1.000000 0.0 0.0 0.0 99.264367 0.0 0.0 0.0 0.083333 1 0 1800.0 351.500199 634.569311 0.000000 12
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
8920 1055.087681 0.666667 0.0 0.0 0.0 1820.116200 0.0 0.0 0.0 0.333333 2 0 2000.0 54.795084 233.229870 0.000000 6
8929 371.527312 0.333333 0.0 0.0 0.0 1465.407927 0.0 0.0 0.0 0.166667 5 0 1500.0 0.000000 NaN 0.000000 6
8937 163.001629 0.666667 0.0 0.0 0.0 274.440466 0.0 0.0 0.0 0.166667 3 0 1000.0 117.738787 60.792450 0.000000 6
8938 78.818407 0.500000 0.0 0.0 0.0 1113.186078 0.0 0.0 0.0 0.166667 7 0 1200.0 1397.770131 21.821194 0.333333 6
8948 13.457564 0.833333 0.0 0.0 0.0 36.558778 0.0 0.0 0.0 0.166667 2 0 500.0 52.549959 55.755628 0.250000 6

2044 rows × 17 columns

  • We've got customers that had no purchases, however their balance did go down due to cash advances. Could also because of fees or interest rates.
  • Also, the balance does not seem to match up if we add together payments and cash advances so it could be that the balance considers the previous billing cycle.

OK, let's start by looking at the most important variable when it comes to credit cards and that is their limit.

In [ ]:
fig = px.histogram(df, x = 'credit_limit', nbins = 100)
fig.show()

As expected, most limits are low.

In [ ]:
fig = px.scatter(df, x = 'credit_limit', y = 'payments')
fig.show()

So we can see ...

  • we've got a variety of purchase behaviors and ...
  • higher credit limit does not presume higher payment amounts.
In [ ]:
fig = px.scatter(df, x = 'purchases', y = 'payments', color = 'cash_advance', trendline = 'ols')
fig.show()
  • But the more purchases the more likely the payments are high.
  • As noticed before, we can see quite some customers not making purchases, partly becaus of cash advances and the other part cannot be explained from this dataset.

It's still a bit fuzzy, let's move on to correlations ...

In [ ]:
correlation_matrix = df.corr()
fig = go.Figure(
    data = go.Heatmap(
            x = correlation_matrix.columns,
            y = correlation_matrix.index,
            z = correlation_matrix.values,
            colorscale = 'YlGnBu')
)

fig.show()

Things that impact each other, of course are highly correlated as ...

  • ... credit limit and balance
  • ... purchases frequency and one-off purchases frequency
  • ... purchases and installments purchases
  • ... etc.

That means, quite some variables probably will not be needed to cluster our customers.

For really understanding where the most variation comes from I will use PCA. But before that, our features need some pre-processing!

Pre-processing¶

  1. Inputting missing values - we have a right-skewed distribution, so will use the median to impute the missing numbers.
  2. Dropping the one row that has a missing value in credit limit.
In [ ]:
fig = px.histogram(df, x = 'minimum_payments', nbins = 50)
fig.show()
In [ ]:
df.dropna(subset = ['credit_limit'], inplace = True)
df['minimum_payments'].fillna(df['minimum_payments'].median(), inplace = True)
print(df.isnull().sum())
balance                             0
balance_frequency                   0
purchases                           0
oneoff_purchases                    0
installments_purchases              0
cash_advance                        0
purchases_frequency                 0
oneoff_purchases_frequency          0
purchases_installments_frequency    0
cash_advance_frequency              0
cash_advance_trx                    0
purchases_trx                       0
credit_limit                        0
payments                            0
minimum_payments                    0
prc_full_payment                    0
tenure                              0
dtype: int64

As we have observed, most variables are skewed, so we need to ...

  1. Normalize the variables with PowerTransformer
  2. Store it separately for clustering

P.S Using PowerTransformer for (1) since we have a variation of different distributions (right skewed, negatively skewed and bimodal) and the features have significantly different scales and variances.

In [ ]:
scaler = PowerTransformer()
X = scaler.fit_transform(df)
np.allclose(np.mean(X), 0)
np.allclose(np.std(X), 1)
Out[ ]:
True

Superfluous step but just so the user know that the data is clearly normalized. Finally, it's time for PCA!

PCA¶

  1. Use all variables
  2. Check the highest variance explained
  3. Plot loadings
In [ ]:
pca = PCA(n_components = None)
X_pca = pca.fit_transform(X)
X_pca_df = pd.DataFrame(X_pca, columns = df.columns)
print(X_pca_df)
       balance  balance_frequency  purchases  oneoff_purchases  \
0    -0.256402          -3.117240  -0.212140         -1.039162   
1    -3.480136           0.875292  -0.180773          1.335073   
2     1.337092           1.187997   2.191266         -1.742006   
3    -1.373930          -0.533114   2.053472         -0.580049   
4    -1.009702          -1.406326   1.019393         -1.981967   
...        ...                ...        ...               ...   
8944  1.863231          -3.006702  -1.740143          1.166673   
8945  1.296106          -2.132573  -1.919821         -0.560237   
8946  1.282313          -3.657642  -1.268103          1.104514   
8947 -2.809793          -3.916708   0.107539          1.826599   
8948 -0.347167          -0.562417   2.151005          1.528104   

      installments_purchases  cash_advance  purchases_frequency  \
0                   0.060030      0.155493            -0.679016   
1                   2.105084     -0.295592             0.679257   
2                   0.009938     -0.361336             0.290415   
3                  -0.582991     -0.741878            -2.211714   
4                   0.287524      0.600037             0.931382   
...                      ...           ...                  ...   
8944               -1.596956     -0.006751             2.150340   
8945               -2.341109     -0.883149             1.208490   
8946               -1.827778     -0.695729             1.084660   
8947               -1.963909      0.650321             1.174312   
8948               -3.911358      0.124713            -0.796946   

      oneoff_purchases_frequency  purchases_installments_frequency  \
0                       0.873231                          0.303741   
1                       0.268776                         -0.647957   
2                      -0.722455                         -0.272510   
3                      -0.887097                         -1.522410   
4                       0.426215                          0.624754   
...                          ...                               ...   
8944                   -0.779580                          0.499237   
8945                    0.028846                          0.568643   
8946                   -0.165126                         -0.362678   
8947                   -0.233841                         -0.353835   
8948                   -0.029382                         -0.248305   

      cash_advance_frequency  cash_advance_trx  purchases_trx  credit_limit  \
0                  -0.386776          0.164891       0.089532      0.632001   
1                   0.133657         -0.360332      -0.357819     -0.154056   
2                   1.218358          0.169826      -0.495254      0.046993   
3                  -1.112262          0.836977      -0.973376     -0.912798   
4                  -0.056678         -0.291900      -0.165492     -0.003175   
...                      ...               ...            ...           ...   
8944                0.013688          0.100195      -0.079311     -0.261552   
8945                0.481943          0.226928       1.089301     -0.119652   
8946                0.169422         -0.104195      -0.238034     -0.237678   
8947               -0.300750         -0.121237      -0.046715      0.186797   
8948                0.697003          0.623269      -0.723173      0.390070   

      payments  minimum_payments  prc_full_payment    tenure  
0     0.106505         -0.178227         -0.133516 -0.023747  
1     0.055915         -0.001065          0.002851 -0.136091  
2    -0.083213         -0.499905         -0.251716 -0.011544  
3     0.127374         -0.273642          0.196251 -0.074892  
4    -0.107770          0.108535         -0.018660  0.011435  
...        ...               ...               ...       ...  
8944 -0.032893         -0.253187         -0.032960  0.057837  
8945  0.107551         -0.306901          0.000984  0.031057  
8946 -0.106563         -0.164333         -0.033685  0.040150  
8947 -0.302203          0.136741          0.030968  0.067428  
8948 -0.371654          0.177702         -0.021312 -0.365567  

[8949 rows x 17 columns]
In [ ]:
fig = go.Figure()
fig.add_trace(go.Scatter(
    x = list(range(1, len(pca.explained_variance_) + 1)),
    y = pca.explained_variance_,
    marker = dict(color = 'blue')
))

fig.update_layout(
    title = "The bend occurs at eigenvalue number 3.",
    xaxis = dict(title = "Eigenvalue number"),
    yaxis = dict(title = "Eigenvalue size"),
)

fig.show()

It could be a signal that using the first 3 components could be enough. Let's look at the explained variance.

In [ ]:
exp_var_cumul = np.cumsum(pca.explained_variance_ratio_)

fig = px.area(
    x = range(1, exp_var_cumul.shape[0] + 1),
    y = exp_var_cumul,
    labels = {"x": "# Components", "y": "Explained Variance"}
)
fig.show()

It's around 5 components where we keep 80% of the explained variance.

In [ ]:
features = df.columns
loadings = pca.components_.T * np.sqrt(pca.explained_variance_)

fig = px.scatter(X_pca, x = 0, y = 1)

for i, feature in enumerate(features):
    fig.add_annotation(
        ax = 0, ay = 0,
        axref="x", ayref="y",
        x = loadings[i, 0],
        y = loadings[i, 1],
        arrowsize = 0.5,
        arrowhead = 1,
        xanchor = "right",
        yanchor = "top")

    fig.add_annotation(
        x=loadings[i, 0],
        y=loadings[i, 1],
        ax=0, ay=0,
            text = f"{feature} (PC1): {loadings[i, 1]:.2f} and (PC2): {loadings[i, 0]:.2f}",

        yshift=-15
    )

fig.show()
  • We can notice one clear cluster on the left side separate from the big cluster.
  • The features that contribute the most to PC1 are the balance and minimum payments, whereas, for PC2 (the y-axis), the largest contributors are a mix from purchases, installments and cash advances.
  • Variables that close together such as puchases & purchases_trx or oneoff_purchases_frequency and oneoff_purchases are positively correlated.

We can notice one clear cluster on the left side separate from the big cluster.

Will use 3 PCA components as features for the Kmeans, as it explains enough of the variance and does not add too much complexity.

In [ ]:
pca = PCA(n_components = 3)
X_pca = pca.fit_transform(X)

K-Means¶

In [ ]:
kmeans_models = [KMeans(n_clusters = k, random_state = 1, verbose = False).fit(X_pca) for k in range (1, 10)]
inertia = [model.inertia_ for model in kmeans_models]

fig = go.Figure()
fig.add_trace(go.Scatter(
              x = list(range(1, 12)), 
              y = inertia, 
              mode = 'lines+markers', 
              marker = dict(size = 10), line = dict(color = 'blue'))
              )
fig.update_layout(title = 'The elbow is around 5 clusters.',
                  xaxis = dict(title = 'Number of Clusters'),
                  yaxis = dict(title = 'WCSS'))

fig.show()

The intertia drop is minimal after K = 5, therefore 5 would be considered as an optimal number of clusters.

In [ ]:
from sklearn.metrics import silhouette_score

silhouette_scores = [silhouette_score(X_pca, model.labels_) for model in kmeans_models[1:5]]
for k, score in enumerate(silhouette_scores, start = 2):
    print(f"Number of clusters: {k}")
    print(f"Silhouette score: round{score}")
    print()
Number of clusters: 2
Silhouette score: round0.37718202525424227

Number of clusters: 3
Silhouette score: round0.36695690303122364

Number of clusters: 4
Silhouette score: round0.3860108281916691

Number of clusters: 5
Silhouette score: round0.40085459141940327

Will move on with 5 clusters. Let's see how it looks like.

In [ ]:
kmeans = KMeans(n_clusters = 5, random_state = 23)
kmeans.fit(X_pca)
Out[ ]:
KMeans(n_clusters=5, random_state=23)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=5, random_state=23)

Adding the clusters into the dataframe ...

In [ ]:
df_pca_kmeans = pd.concat([df.reset_index(drop = True), pd.DataFrame(X_pca)], axis = 1)
df_pca_kmeans.columns.values[-3: ] = ['PC_1', 'PC_2', 'PC_3']
df_pca_kmeans = df_pca_kmeans.reset_index(drop = True)
df_pca_kmeans['cluster_id'] = kmeans.labels_
cluster_names = {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}
df_pca_kmeans['cluster'] = df_pca_kmeans['cluster_id'].map(cluster_names)
In [ ]:
x_axis = df_pca_kmeans['PC_1']
y_axis = df_pca_kmeans['PC_2']
fig = px.scatter(x = x_axis, y = y_axis, color = df_pca_kmeans['cluster'], color_continuous_scale = custom_color_scale)
fig.show()

The model is not perfect, but unfortunately the points do not create super clear clusters, as there is quite some variance.

Introducing the third component ...

In [ ]:
z_axis = df_pca_kmeans['PC_3']
fig = px.scatter_3d(x = x_axis, y = y_axis, z = z_axis, color = df_pca_kmeans['cluster'], color_continuous_scale = custom_color_scale)
fig.show()

Now, let's go back to the original data and see what that actually means.

In [ ]:
fig = px.box(df_pca_kmeans, x = 'cluster', y = 'credit_limit')
fig.update_layout(
    title_text = 'Credit limit is higher for cluster 1 and 2.'
)
fig.show()

We can see that the first two clusters are the ones with higher limit. Would assume they also spend more?

In [ ]:
fig = px.scatter(df_pca_kmeans, x = 'credit_limit', y = 'purchases', color = 'cluster', color_continuous_scale = custom_color_scale)
fig.update_layout(
    title_text = 'Cluster 4 and 5 are not big purchasers.'
)
fig.show()

Yes, that is definitely one additional differentiator. The lower part contain customers that do not make a lot of purchases - cluster 4, 5.

In [ ]:
fig = px.scatter(df_pca_kmeans, x = 'credit_limit', y = 'balance', color = 'cluster',  color_continuous_scale = custom_color_scale)
fig.update_layout(
    title_text = '... especially Cluster 4 of customers does not use it for much.'
)
fig.show()

Now we can see that last two clusters (4, 5) are customers that do not make much business. But for the rest, this is not a clear differentiator.
Where does it come from then? Let's check cash advances, as we have seen in the past that there were quite some.

In [ ]:
fig = px.box(df_pca_kmeans, x = 'cluster', y = 'cash_advance', log_y = True)
fig.update_layout(
    title_text = '... Cash advances are a favourite for cluster 1 and 3.'
)
fig.show()

Here is where cluster 3 comes in. These are the customers who have an ok limit but they use it for quite some cash advances.

Let's go back to our PCA view, and observe the impact by certain variables.

In [ ]:
x_axis = df_pca_kmeans['PC_1']
y_axis = df_pca_kmeans['PC_2']
fig = px.scatter(x = x_axis, y = y_axis, color = df_pca_kmeans['credit_limit'], color_continuous_scale = custom_color_scale)
fig.show()

Now, if we remember our clusters. We can already see that the lower part is mostly low credit limit and higher part of y axis is higher limit.
But, still, this explains a small part of it.

This variable seems to cluster the lower cluster, the 5th - the low limit customers that do not really make business.

In [ ]:
fig = px.scatter(x = x_axis, y = y_axis, color = df_pca_kmeans['purchases_frequency'], color_continuous_scale = custom_color_scale)
fig.show()

Now we are getting somewhere.

  • You see how the cluster on the left side is clearly far away from all? This is the no or almost no usage group.
  • Then on the other side we've got the high purchase group.
In [ ]:
df_cluster_metrics = df_pca_kmeans.groupby('cluster').describe().T.reset_index().rename(columns = {'level_0': 'Column', 'level_1': 'Metrics'})
In [ ]:
df_cluster_profile = df_cluster_metrics[df_cluster_metrics['Metrics'] == '50%'].set_index('Column').reset_index()
df_cluster_profile.style.background_gradient(cmap = 'YlGnBu', axis = 1).hide_index()
Out[ ]:
Column Metrics 1 2 3 4 5
balance 50% 2705.591995 756.857540 1460.077343 164.634007 61.115414
balance_frequency 50% 1.000000 1.000000 1.000000 0.857143 1.000000
purchases 50% 824.770000 2077.825000 0.000000 368.405000 350.000000
oneoff_purchases 50% 375.250000 1132.320000 0.000000 324.830000 0.000000
installments_purchases 50% 289.250000 669.750000 0.000000 0.000000 335.620000
cash_advance 50% 1574.545965 0.000000 1241.559904 0.000000 0.000000
purchases_frequency 50% 0.750000 1.000000 0.000000 0.250000 0.833333
oneoff_purchases_frequency 50% 0.200000 0.583333 0.000000 0.166667 0.000000
purchases_installments_frequency 50% 0.500000 0.750000 0.000000 0.000000 0.750000
cash_advance_frequency 50% 0.250000 0.000000 0.250000 0.000000 0.000000
cash_advance_trx 50% 5.000000 0.000000 4.000000 0.000000 0.000000
purchases_trx 50% 13.000000 29.000000 0.000000 4.000000 10.000000
credit_limit 50% 5000.000000 6000.000000 3000.000000 3000.000000 2000.000000
payments 50% 1664.863771 1866.978859 729.064946 533.568794 422.709468
minimum_payments 50% 1026.828021 248.785651 475.102035 183.180968 165.163582
prc_full_payment 50% 0.000000 0.000000 0.000000 0.000000 0.142857
tenure 50% 12.000000 12.000000 12.000000 12.000000 12.000000
PC_1 50% -0.014990 3.043762 -3.468358 0.130777 1.657189
PC_2 50% 2.550159 0.888861 0.017824 -1.102028 -1.875139
PC_3 50% -0.383794 0.451095 -0.271980 1.958574 -1.382261
cluster_id 50% 0.000000 1.000000 2.000000 3.000000 4.000000
  • Cluster 1 (Active Cash Advance Customers) They have a high limit than most and they use it for purchasing stuff. Additionally, they make use of installments and a bit of cash advances.

  • Cluster 2 (All-in Active Customers) They have a higher limit than Cluster 1, however they purchase even more frequently and have a higher use of installments.

  • Cluster 3 (Cash Advance Customers) They have a lower limit and use a higher proportion of it for cash advances than others.

  • Cluster 4 (Dead Customers) They do no do much. They do not buy, they do not take cash advances.

  • Cluster 5 (Installment Customers) They prefer safety and therefore to make purchases via installments, and do not take any cash advance.

So what kind of marketing could the bank target depending on the customer?

Marketing Strategies¶

Cluster 1 (Active Cash Advance Customers):

  • How: Provide discounts or rewards for specific purchase categories.
  • Why: Maintain the interest.

Cluster 2 (All-in Active Customers):

  • How: Introduce exclusive rewards for high-frequency transactions.
  • Why: This group has the money and like to spend it.

Cluster 3 (Cash Advance Customers):

  • How: Explore other credit card types e.g. low cash advance but low fees and higher tenure.
  • Why: Too frequent cash advances are risky for a bank. In the best case, this group of customers would decrease in size and move to the Active Customer group.

Cluster 4 (Dead Customers):

  • How: Consider contacting customers with surveys to understand their disengagement reasons.
  • Why: This group would need to be understood first before targeting with a marketing technique. Why don't they purchase? Are these people with lower income, are they young?

Cluster 5 (Installment Customers):

  • How: Develop partnerships with merchants offering installment-friendly services and potentially that are low in interest.
  • Why: It is scalable across other clusters e.g. dead customers.